The topic Modelling is an unsupervised method that is used to deduce the abstract topics discussed over a collection of documents. Since the aim of our project is to classify documents, we have used topic modeling as a means to label our data. Once we have the labeled data, the unseen test documents are classified based on the topic probabilities.
The first step in performing LDA is to deduce the optimal number of “topics”. This is achieved by using the perplexity measure.Since all the topics are represented by probabilities, we need to measure how well these distributions predict a sample, so we use perplexity. The perplexity measure is applied on LDA objects with k ranging from 10 to 30 for both the Bag Of Words model and the TF-IDF Model. The LDA object with the lowest K is deemed to be the best model, and k is deemed to be the optimal number of topics.
…
So in our case, the best model turned out to be the Baf Of Words model, with K=25 topics.
The LDA model was built for term frequency with k = 25 topics. Both the Gibbs Sampling and the Dot product was used for this purpose.
Model for term frequency with Gibbs sampling
LDA_model_bow <- FitLdaModel(dtm = sparse_matrix_dtm_bow, k = as.integer(i),
iterations = 200, burnin = 175)
p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),],
method = "gibbs",iterations = 200, burnin = 175)
p2_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot")
Model for term frequency with Dot product sampling
LDA_model_bow <- FitLdaModel(dtm = sparse_matrix_dtm_bow, k = as.integer(i),
iterations = 200, burnin = 175)
p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),],
method = "dot",iterations = 200, burnin = 175)
p2_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot")
So now each document has a probability associated with it with respect to the 25 topics. This acts as the labelled data for further prediction.
Once we have all the documents in the training set labeled, the next step is predicting the topic probabilities for the unseen test set. The predict method of LDA is used to predict the topic probabilities.
Predicting the topics using Term frequency with Gibbs sampling model
p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "gibbs",iterations = 200, burnin = 175)
The probability distribution of the topics in the train & test set for the Term frequency with Gibbs sampling model can be seen in the plot below.
…
Predicting the topics using Term frequency with Dot product sampling model
p1_bow <- predict(LDA_model_bow, sparse_matrix_dtm_bow[17:nrow(sparse_matrix_dtm_bow),], method = "dot",iterations = 200, burnin = 175)
The probability distribution of the topics in the train & test set for the Term frequency with dot product sampling model can be seen in the plot below.
Perplexity Score
The next step is to evaluate the model, for which we used log likelihood. Higher the value, the better is the model. The plot below shows the log likehood for the two models.
Perplexity Score
From the plot it is evident that the bag of words model (term frequency) performs better and hence we have used this model here forth.
Once the prediction is done, we now have topic probabilities for all the documents. It is interesting to find similarities in-between topics, so we are clustering the documents based on their topic probabilities.
To perform clustering, we need to decide on the optimal number of clusters. This was determined by using elbow curve The optimal number of clusters by elbow curve is 8.
#Reducing the dimensions via tsne
tsne <- Rtsne(doc_topics_gamma[,-1], perplexity = 30, pca = FALSE, check_duplicates = FALSE)
X <- data.frame(tsne$Y)
#Find best no. of clusters for 25 topics
wss <- (nrow(X)-1)*sum(apply(X,2,var))
for (i in 1:100) wss[i] <- sum(kmeans(X,iter.max = 50L,centers=i)$withinss)
plot(1:100, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
Perplexity Score
Another approach to find optimal number of clusters used was silhouette coefficient. The silhouette coefficient is used to determine the inter and intra distance for all the points within the cluster to themselves and to the points in the other cluster. We evaluated this value for 8 & 15 clusters and the results can be seen in the plots below.
Perplexity Score
Perplexity Score
The silhouette coefficient for our cluster was 0.33. Given our dataset where all our documents are talking about coronavirus, its no wonder the value for silhouette coefficient is less as the distance between the cluster is negligle and thus the documents within them.
Finally, the articles were grouped into 8 clusters.
k3 <- kmeans(X,centers = 8, nstart = 5,iter.max = 100000L)
fviz_cluster(k3,X)
Convex Hull Plot for 8 clusters
It would be interesting to see how the topics are associated to the clusters. The chord diagram shows the association of each of the topics to the clusters.
Convex Hull Plot for 8 clusters
Convex Hull Plot for 8 clusters
The entire document corpus has been visualized in the RBokeh graph. On hovering on the documents, it can be seen that the documents belonging to the same topics are relatively close to each other. However, some exception exists near the boundaries of each topic.Hovering over the documents, displays the title, URL, and the most dominant topic in it.